Web Translation Mining Based on Suffix Arrays

نویسندگان

  • Gaolin Fang
  • Hao Yu
چکیده

Mining translations from abundant Web data can be applied in many fields such as computer assisted learning, machine translation and cross-language information retrieval. How to mine possible translations from the Web and obtain the boundary of candidates, and how to remove irrelevant noises and rank the candidates are the challenging issues. In this paper, after reviewing and analyzing all possible methods of acquiring translations, a statistics method based on suffix arrays is proposed to mine term translations from the Web. The proposed method can not only mine different forms of Web translation distributions but also effectively obtain the correct boundary of translations, and then sort-based subset deletion and mutual information methods are respectively proposed to deal with subset redundancy information and affix redundancy information formed in the process of estimation. Experiments on two test sets of 401 English-Chinese terms and 100 English-Japanese terms validate that our system has good performance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Web-Based Terminology Translation Mining

Mining terminology translation from a large amount of Web data can be applied in many fields such as reading/writing assistant, machine translation and cross-language information retrieval. How to find more comprehensive results from the Web and obtain the boundary of candidate translations, and how to remove irrelevant noises and rank the remained candidates are the challenging issues. In this...

متن کامل

Semi-Supervised Lexicon Mining from Parenthetical Expressions in Monolingual Web Pages

This paper presents a semi-supervised learning framework for mining Chinese-English lexicons from large amount of Chinese Web pages. The issue is motivated by the observation that many Chinese neologisms are accompanied by their English translations in the form of parenthesis. We classify parenthetical translations into bilingual abbreviations, transliterations, and translations. A frequency-ba...

متن کامل

Distributed text search using suffix arrays

Text search is a classical problem in Computer Science, with many data-intensive applications. For this problem, suffix arrays are among the most widely known and used data structures, enabling fast searches for phrases, terms, substrings and regular expressions in large texts. Potential application domains for these operations include large-scale search services, such as Web search engines, wh...

متن کامل

Hierarchical Phrase-Based Translation with Suffix Arrays

A major engineering challenge in statistical machine translation systems is the efficient representation of extremely large translation rulesets. In phrase-based models, this problem can be addressed by storing the training data in memory and using a suffix array as an efficient index to quickly lookup and extract rules on the fly. Hierarchical phrasebased translation introduces the added wrink...

متن کامل

Efficient Discovery of Proximity Patterns with Suffix Arrays

We describe an efficient implementation of a text mining algorithm for discovering a class of simple string patterns. With an index structure, called the virtual suffix tree, for pattern discovery built on the top of the suffix array, the resulting algorithm is simple and fast in practice compared with the previous implementation with the suffix tree.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Chinese Language and Computing

دوره 17  شماره 

صفحات  -

تاریخ انتشار 2007